-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/mle bench evaluation #5148
Conversation
This PR is currently blocked by #4848, reproducible as follows:
Checking
|
&& wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /tmp/miniconda.sh \ | ||
&& bash /tmp/miniconda.sh -b -p /opt/conda \ | ||
&& rm /tmp/miniconda.sh \ | ||
&& /opt/conda/bin/conda init |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a thing here, not necessarily for today, but IMHO for the PR overall: we don't use Anaconda's repos and channels anywhere in the codebase, afaik. We have replaced this with miniforge and micromamba (example). These two are compatible with the other conda, and we can set the channel to community repositories.
The reason is Anaconda's weird licensing. It's not open source, and while it doesn't have unexpected terms for individuals or academia (iirc), it does have for companies/employees of 200 people or more. Please note also, the -b
parameter of the script runs it silently, which, according to Anaconda, means that the user is "assumed to have agreed". (I think, if we must for some reason use this, we'd need to notify people in some very clear way.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's good to know, I'll make the switch where I can. The only reason it's included here is because the MLE-bench base agent sandbox images use it for managing a virtual environment.
This PR is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
This PR was closed because it has been stalled for over 30 days with no activity. |
@csmith49 The automated checks closed this PR. I reopened it to keep the status quo, but please feel free to do as you see fit. |
Talked to Calvin. Going to close this and he will reopen once this is ready again. |
End-user friendly description of the problem this fixes or functionality that this introduces
Give a summary of what the PR does, explaining any non-trivial design decisions
This PR adds support for testing OpenHands agents on MLE-bench using the standard OpenHands evaluation harness.
The MLE-bench implementation provides:
agent
definition format.The goal of this PR is to re-use as much existing infrastructure as possible by providing a suitable OpenHands
agent
definition. However, only the scripts from 1. are exposed as a Python package, so we assume the tester has OpenAI's implementation installed elsewhere to manage the base image and test instances and need to re-implement some minor scaffolding aroundagent
definitions to allow for benchmarking from this repo.Link of any specific issues this addresses
#4328